{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preparation\n",
    "\n",
    "We do the folloiwng:\n",
    "\n",
    "1. Install [EVAR (Evaluation package for Audio Representations)](https://github.com/nttcslab/eval-audio-repr) and a program for training and testing on the CirCor dataset using various pre-trained audio representations.\n",
    "2. Clone heart-murmur-detection repository.\n",
    "3. Download dataset.\n",
    "4. Create stratified splits.\n",
    "5. Make the training data for EVAR.\n",
    "6. Finally, check the data integrity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from pathlib import Path\n",
    "import shutil\n",
    "import librosa\n",
    "import torch\n",
    "import torchaudio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "## CLEAN UP -- Uncomment and run this step when you need to clearn up.\n",
    "# ! rm -fr evar heart-murmur-detection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Code setup: EVAR and a main program\n",
    "\n",
    "Our experiments run using [EVAR (Evaluation package for Audio Representations)](https://github.com/nttcslab/eval-audio-repr) and use a main program, `circor_eval.py`, on top of it.\n",
    "\n",
    "The following does:\n",
    "- Clone EVAR and download the additional Python codes.\n",
    "- Apply a patch on EVAR to extend it to the heart murmur detection tasks (circor1 to 3).\n",
    "- Copy a program, `circor_eval.py`, that performs training and tests a model on the task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into 'evar'...\n",
      "remote: Enumerating objects: 355, done.\u001b[K\n",
      "remote: Counting objects: 100% (230/230), done.\u001b[K\n",
      "remote: Compressing objects: 100% (152/152), done.\u001b[K\n",
      "remote: Total 355 (delta 158), reused 127 (delta 78), pack-reused 125\u001b[K\n",
      "Receiving objects: 100% (355/355), 894.25 KiB | 11.46 MiB/s, done.\n",
      "Resolving deltas: 100% (220/220), done.\n",
      "Note: switching to '75eedb4e4c4628ac5478c1a975abe1969beaf291'.\n",
      "\n",
      "You are in 'detached HEAD' state. You can look around, make experimental\n",
      "changes and commit them, and you can discard any commits you make in this\n",
      "state without impacting any branches by switching back to a branch.\n",
      "\n",
      "If you want to create a new branch to retain commits you create, you may\n",
      "do so (now or later) by using -c with the switch command. Example:\n",
      "\n",
      "  git switch -c <new-branch-name>\n",
      "\n",
      "Or undo this operation with:\n",
      "\n",
      "  git switch -\n",
      "\n",
      "Turn off this advice by setting config variable advice.detachedHead to false\n",
      "\n",
      "HEAD is now at 75eedb4 Update m2d.yaml\n",
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 17199  100 17199    0     0   987k      0 --:--:-- --:--:-- --:--:--  987k\n",
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100  3331  100  3331    0     0   203k      0 --:--:-- --:--:-- --:--:--  203k\n",
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100  6092  100  6092    0     0   371k      0 --:--:-- --:--:-- --:--:--  371k\n",
      "patching file evar/ds_tasks.py\n",
      "patching file finetune.py\n"
     ]
    }
   ],
   "source": [
    "! git clone https://github.com/nttcslab/eval-audio-repr.git evar\n",
    "! (cd evar && git checkout 75eedb4e4c4628ac5478c1a975abe1969beaf291)\n",
    "! (cd evar && curl https://raw.githubusercontent.com/daisukelab/general-learning/master/MLP/torch_mlp_clf2.py -o evar/utils/torch_mlp_clf2.py)\n",
    "! (cd evar && curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/sampler.py -o evar/sampler.py)\n",
    "! (cd evar && curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/cnn14_decoupled.py -o evar/cnn14_decoupled.py)\n",
    "! (cd evar && patch -p1 < ../diff-evar.patch)\n",
    "! (cd evar/external && ln -s ../../../.. m2d)\n",
    "! cp circor_eval.py evar/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1-1 Setup pre-trained models\n",
    "\n",
    "For the experiment using M2D, please download the weight file.\n",
    "\n",
    "```sh\n",
    "wget https://github.com/nttcslab/m2d/releases/download/v0.1.0/m2d_vit_base-80x608p16x16-221006-mr7_enconly.zip\n",
    "unzip m2d_vit_base-80x608p16x16-221006-mr7_enconly.zip\n",
    "```\n",
    "\n",
    "For the setup of AST and BYOL-A, please follow the steps in [Preparing-models.md](https://github.com/nttcslab/eval-audio-repr/blob/main/Preparing-models.md) in EVAR."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Clone heart-murmur-detection repository\n",
    "\n",
    "Thanks to the repository: https://github.com/Benjamin-Walker/heart-murmur-detection\n",
    "\n",
    "- We use this commit: `https://github.com/Benjamin-Walker/heart-murmur-detection/tree/60f5420918b151e06932f70a52649d9562f0be2d`\n",
    "- Then, we make local modifications using `diff-heart-murmur-detection.patch`\n",
    "\n",
    "```bibtex\n",
    "@article{walker2022DBResNet,\n",
    "    title={Dual Bayesian ResNet: A Deep Learning Approach to Heart Murmur Detection},\n",
    "    author={Benjamin Walker and Felix Krones and Ivan Kiskin and Guy Parsons and Terry Lyons and Adam Mahdi},\n",
    "    journal={Computing in Cardiology},\n",
    "    volume={49},\n",
    "    year={2022}\n",
    "}\n",
    "```\n",
    "\n",
    "We used this repository to create stratified data splits and perform the evaluation in exactly the same way as Walker et al."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into 'heart-murmur-detection'...\n",
      "remote: Enumerating objects: 398, done.\u001b[K\n",
      "remote: Counting objects: 100% (398/398), done.\u001b[K\n",
      "remote: Compressing objects: 100% (261/261), done.\u001b[K\n",
      "remote: Total 398 (delta 229), reused 266 (delta 115), pack-reused 0\u001b[K\n",
      "Receiving objects: 100% (398/398), 2.35 MiB | 10.92 MiB/s, done.\n",
      "Resolving deltas: 100% (229/229), done.\n",
      "Note: switching to '60f5420918b151e06932f70a52649d9562f0be2d'.\n",
      "\n",
      "You are in 'detached HEAD' state. You can look around, make experimental\n",
      "changes and commit them, and you can discard any commits you make in this\n",
      "state without impacting any branches by switching back to a branch.\n",
      "\n",
      "If you want to create a new branch to retain commits you create, you may\n",
      "do so (now or later) by using -c with the switch command. Example:\n",
      "\n",
      "  git switch -c <new-branch-name>\n",
      "\n",
      "Or undo this operation with:\n",
      "\n",
      "  git switch -\n",
      "\n",
      "Turn off this advice by setting config variable advice.detachedHead to false\n",
      "\n",
      "HEAD is now at 60f5420 Update README.md\n",
      "patching file heart-murmur-detection/ModelEvaluation/evaluate_model.py\n"
     ]
    }
   ],
   "source": [
    "! git clone https://github.com/Benjamin-Walker/heart-murmur-detection.git\n",
    "! (cd heart-murmur-detection && git checkout 60f5420918b151e06932f70a52649d9562f0be2d)\n",
    "! patch -p1 < diff-heart-murmur-detection.patch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Download dataset\n",
    "\n",
    "We download `physionet.org/files/circor-heart-sound/1.0.3/` as in the previous studies.\n",
    "\n",
    "**NOTE: Downloading takes about 1 hour depending on the network condition.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ! wget -r -N -c -np https://physionet.org/files/circor-heart-sound/1.0.3/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "... you will see the downloading logs here ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Create stratified splits\n",
    "\n",
    "This step creates the stratified splits used in our experiments by using `datalist_stratified_data1~3.csv` files, whic are the exact list of the files for each split.\n",
    "\n",
    "The resulting files are copied in the `heart-murmur-detection/data` folder.\n",
    "\n",
    "### Notes for how we made the splits\n",
    "\n",
    "To create stratified splits, we used the code from the repository `heart-murmur-detection` with modified options `--vali_size 0.1 --test_size 0.25`, for making training/validation/test sets with a proportion of 65:10:25.\n",
    "\n",
    "- Running `main.py` creates stratified data splits under `data/stratified_data` (`--stratified_directory`).\n",
    "- Then, we ran followings and got three splits under `data/stratified_data1`, `data/stratified_data2`, and `data/stratified_data3`.\n",
    "    ```sh\n",
    "    CUDA_VISIBLE_DEVICES=0 python main.py --full_data_directory physionet.org/files/circor-heart-sound/1.0.3/training_data --stratified_directory data/stratified_data1 --vali_size 0.1 --test_size 0.25 --random_state 14 --recalc_features --spectrogram_directory data/spectrograms1 --model_name resnet50dropout --recalc_output --dbres_output_directory outputs1\n",
    "    CUDA_VISIBLE_DEVICES=0 python main.py --full_data_directory physionet.org/files/circor-heart-sound/1.0.3/training_data --stratified_directory data/stratified_data2 --vali_size 0.1 --test_size 0.25 --random_state 42 --recalc_features --spectrogram_directory data/spectrograms2 --model_name resnet50dropout --recalc_output --dbres_output_directory outputs2\n",
    "    CUDA_VISIBLE_DEVICES=1 python main.py --full_data_directory physionet.org/files/circor-heart-sound/1.0.3/training_data --stratified_directory data/stratified_data3 --vali_size 0.1 --test_size 0.25 --random_state 84 --recalc_features --spectrogram_directory data/spectrograms3 --model_name resnet50dropout --recalc_output --dbres_output_directory outputs3\n",
    "    ```\n",
    "- After creating these folders, we made lists of the stratified files as datalist_stratified_data1.csv, datalist_stratified_data2.csv, and datalist_stratified_data3.csv."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "import shutil\n",
    "\n",
    "split_csvs = ['./datalist_stratified_data1.csv', './datalist_stratified_data2.csv', './datalist_stratified_data3.csv']\n",
    "df = pd.concat([pd.read_csv(f) for f in split_csvs], ignore_index=True)\n",
    "\n",
    "dest = Path('heart-murmur-detection/data')\n",
    "for f in df.dest_file.values:\n",
    "    f = Path(f)\n",
    "    f.parent.mkdir(exist_ok=True, parents=True)\n",
    "    from_file = Path('physionet.org/files/circor-heart-sound/1.0.3/training_data')/f.name\n",
    "    #print('Copy', from_file, 'to', f)\n",
    "    shutil.copy(from_file, f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Convert the data files for fine-tuning\n",
    "\n",
    "We need to convert the original data samples (*.wav) into 5-s segments:\n",
    "- Process 3 stratified data splits independently.\n",
    "- Source filres are from the `heart-murmur-detection/data/stratified_dataX` folder.\n",
    "- Converted into the `evar/work/16k/circorX` folder.\n",
    "- All (long) source files are cropped into 5-s segments with a window duration of 5 s and stride of 2.5 s.\n",
    "\n",
    "The evaluation package `EVAR` uses data samples under `evar/work/16k`, while the final test result calculation using the code from `heart-murmur-detection` uses the files in `heart-murmur-detection/data`.\n",
    "\n",
    "The metadata files will also be created as `evar/evar/metadata/circorX.csv`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split #1 has samples: Training:611(64.86%), Val:95(10.08%), Test:236(25.05%)\n",
      " Training sample IDs are: [2530, 9979, 9983] ...\n",
      "Split 1\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file_name</th>\n",
       "      <th>label</th>\n",
       "      <th>split</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2530_AV_0</th>\n",
       "      <td>train/2530_AV_0.wav</td>\n",
       "      <td>Absent</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2530_AV_1</th>\n",
       "      <td>train/2530_AV_1.wav</td>\n",
       "      <td>Absent</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2530_AV_2</th>\n",
       "      <td>train/2530_AV_2.wav</td>\n",
       "      <td>Absent</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     file_name   label  split\n",
       "2530_AV_0  train/2530_AV_0.wav  Absent  train\n",
       "2530_AV_1  train/2530_AV_1.wav  Absent  train\n",
       "2530_AV_2  train/2530_AV_2.wav  Absent  train"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split #2 has samples: Training:611(64.86%), Val:95(10.08%), Test:236(25.05%)\n",
      " Training sample IDs are: [9979, 9983, 14241] ...\n",
      "Split 2\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file_name</th>\n",
       "      <th>label</th>\n",
       "      <th>split</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2530_AV_0</th>\n",
       "      <td>valid/2530_AV_0.wav</td>\n",
       "      <td>Absent</td>\n",
       "      <td>valid</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2530_AV_1</th>\n",
       "      <td>valid/2530_AV_1.wav</td>\n",
       "      <td>Absent</td>\n",
       "      <td>valid</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2530_AV_2</th>\n",
       "      <td>valid/2530_AV_2.wav</td>\n",
       "      <td>Absent</td>\n",
       "      <td>valid</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     file_name   label  split\n",
       "2530_AV_0  valid/2530_AV_0.wav  Absent  valid\n",
       "2530_AV_1  valid/2530_AV_1.wav  Absent  valid\n",
       "2530_AV_2  valid/2530_AV_2.wav  Absent  valid"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split #3 has samples: Training:611(64.86%), Val:95(10.08%), Test:236(25.05%)\n",
      " Training sample IDs are: [2530, 9979, 24160] ...\n",
      "Split 3\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file_name</th>\n",
       "      <th>label</th>\n",
       "      <th>split</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2530_AV_0</th>\n",
       "      <td>train/2530_AV_0.wav</td>\n",
       "      <td>Absent</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2530_AV_1</th>\n",
       "      <td>train/2530_AV_1.wav</td>\n",
       "      <td>Absent</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2530_AV_2</th>\n",
       "      <td>train/2530_AV_2.wav</td>\n",
       "      <td>Absent</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     file_name   label  split\n",
       "2530_AV_0  train/2530_AV_0.wav  Absent  train\n",
       "2530_AV_1  train/2530_AV_1.wav  Absent  train\n",
       "2530_AV_2  train/2530_AV_2.wav  Absent  train"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Patient ID</th>\n",
       "      <th>Recording locations:</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Pregnancy status</th>\n",
       "      <th>Murmur</th>\n",
       "      <th>Murmur locations</th>\n",
       "      <th>Most audible location</th>\n",
       "      <th>...</th>\n",
       "      <th>Systolic murmur quality</th>\n",
       "      <th>Diastolic murmur timing</th>\n",
       "      <th>Diastolic murmur shape</th>\n",
       "      <th>Diastolic murmur grading</th>\n",
       "      <th>Diastolic murmur pitch</th>\n",
       "      <th>Diastolic murmur quality</th>\n",
       "      <th>Outcome</th>\n",
       "      <th>Campaign</th>\n",
       "      <th>Additional ID</th>\n",
       "      <th>split</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2530</td>\n",
       "      <td>AV+PV+TV+MV</td>\n",
       "      <td>Child</td>\n",
       "      <td>Female</td>\n",
       "      <td>98.0</td>\n",
       "      <td>15.9</td>\n",
       "      <td>False</td>\n",
       "      <td>Absent</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Abnormal</td>\n",
       "      <td>CC2015</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>9979</td>\n",
       "      <td>AV+PV+TV+MV</td>\n",
       "      <td>Child</td>\n",
       "      <td>Female</td>\n",
       "      <td>103.0</td>\n",
       "      <td>13.1</td>\n",
       "      <td>False</td>\n",
       "      <td>Present</td>\n",
       "      <td>AV+MV+PV+TV</td>\n",
       "      <td>TV</td>\n",
       "      <td>...</td>\n",
       "      <td>Harsh</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Abnormal</td>\n",
       "      <td>CC2015</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>9983</td>\n",
       "      <td>AV+PV+TV+MV</td>\n",
       "      <td>Child</td>\n",
       "      <td>Male</td>\n",
       "      <td>115.0</td>\n",
       "      <td>19.1</td>\n",
       "      <td>False</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Abnormal</td>\n",
       "      <td>CC2015</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Patient ID Recording locations:    Age     Sex  Height  Weight  \\\n",
       "0        2530          AV+PV+TV+MV  Child  Female    98.0    15.9   \n",
       "1        9979          AV+PV+TV+MV  Child  Female   103.0    13.1   \n",
       "2        9983          AV+PV+TV+MV  Child    Male   115.0    19.1   \n",
       "\n",
       "   Pregnancy status   Murmur Murmur locations Most audible location  ...  \\\n",
       "0             False   Absent              NaN                   NaN  ...   \n",
       "1             False  Present      AV+MV+PV+TV                    TV  ...   \n",
       "2             False  Unknown              NaN                   NaN  ...   \n",
       "\n",
       "  Systolic murmur quality Diastolic murmur timing Diastolic murmur shape  \\\n",
       "0                     NaN                     NaN                    NaN   \n",
       "1                   Harsh                     NaN                    NaN   \n",
       "2                     NaN                     NaN                    NaN   \n",
       "\n",
       "  Diastolic murmur grading Diastolic murmur pitch Diastolic murmur quality  \\\n",
       "0                      NaN                    NaN                      NaN   \n",
       "1                      NaN                    NaN                      NaN   \n",
       "2                      NaN                    NaN                      NaN   \n",
       "\n",
       "    Outcome Campaign Additional ID  split  \n",
       "0  Abnormal   CC2015           NaN  train  \n",
       "1  Abnormal   CC2015           NaN  train  \n",
       "2  Abnormal   CC2015           NaN   test  \n",
       "\n",
       "[3 rows x 24 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfs = []\n",
    "\n",
    "for split_no in [1, 2, 3]:\n",
    "    trn = sorted(Path(f'heart-murmur-detection/data/stratified_data{split_no}/train_data/').glob('*.wav'))\n",
    "    val = sorted(Path(f'heart-murmur-detection/data/stratified_data{split_no}/vali_data/').glob('*.wav'))\n",
    "    tst = sorted(Path(f'heart-murmur-detection/data/stratified_data{split_no}/test_data/').glob('*.wav'))\n",
    "    #Tr, V, Te = len(trn), len(val), len(tst)\n",
    "\n",
    "    itrn = sorted(list(set([int(f.stem.split('_')[0]) for f in trn])))\n",
    "    ival = sorted(list(set([int(f.stem.split('_')[0]) for f in val])))\n",
    "    itst = sorted(list(set([int(f.stem.split('_')[0]) for f in tst])))\n",
    "    Tr, V, Te = len(itrn), len(ival), len(itst)\n",
    "    N = Tr + V + Te\n",
    "    print(f'Split #{split_no} has samples: Training:{Tr}({Tr/N*100:.2f}%), Val:{V}({V/N*100:.2f}%), Test:{Te}({Te/N*100:.2f}%)')\n",
    "    print(' Training sample IDs are:', itrn[:3], '...')\n",
    "\n",
    "    df = pd.read_csv('physionet.org/files/circor-heart-sound/1.0.3/training_data.csv')\n",
    "\n",
    "    def get_split(pid):\n",
    "        if pid in itrn: return 'train'\n",
    "        if pid in ival: return 'valid'\n",
    "        if pid in itst: return 'test'\n",
    "        assert False, f'Patient ID {pid} Unknown'\n",
    "    df['split'] = df['Patient ID'].apply(get_split)\n",
    "\n",
    "\n",
    "    SR = 16000\n",
    "    L = int(SR * 5.0)\n",
    "    STEP = int(SR * 2.5)\n",
    "\n",
    "    ROOT = Path('physionet.org/files/circor-heart-sound/1.0.3/training_data/')\n",
    "    TO_FOLDER = Path(f'evar/work/16k/circor{split_no}')\n",
    "\n",
    "    evardf = pd.DataFrame()\n",
    "\n",
    "    for i, r in df.iterrows():\n",
    "        pid, recloc, split, label = str(r['Patient ID']), r['Recording locations:'], r.split, r.Murmur\n",
    "        # Not using recloc. Search real recordings...\n",
    "        recloc = [f.stem.replace(pid+'_', '') for f in sorted(ROOT.glob(f'{pid}_*.wav'))]\n",
    "        #print(pid, recloc, split, label)\n",
    "        for rl in recloc:\n",
    "            wav, sr = librosa.load(f'{ROOT}/{pid}_{rl}.wav', sr=SR)\n",
    "            for widx, pos in enumerate(range(0, len(wav) - STEP + 1, STEP)):\n",
    "                w = wav[pos:pos+L]\n",
    "                org_len = len(w)\n",
    "                if org_len < L:\n",
    "                    w = np.pad(w, (0, L - org_len))\n",
    "                    assert len(w) == L\n",
    "                to_name = TO_FOLDER/split/f'{pid}_{rl}_{widx}.wav'\n",
    "                to_rel_name = to_name.relative_to(TO_FOLDER)\n",
    "                #print(pid, rl, len(wav)/SR, to_name, to_rel_name, org_len, len(w), pos)\n",
    "                evardf.loc[to_name.stem, 'file_name'] = to_rel_name\n",
    "                evardf.loc[to_name.stem, 'label'] = label\n",
    "                evardf.loc[to_name.stem, 'split'] = split\n",
    "\n",
    "                to_name.parent.mkdir(exist_ok=True, parents=True)\n",
    "                w = torch.tensor(w * 32767.0).to(torch.int16).unsqueeze(0)\n",
    "                torchaudio.save(to_name, w, SR)\n",
    "    evardf.to_csv(f'evar/evar/metadata/circor{split_no}.csv', index=None)\n",
    "    print('Split', split_no)\n",
    "    display(evardf[:3])\n",
    "\n",
    "df[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(27361,\n",
       "                        file_name   label  split\n",
       " 2530_AV_0    train/2530_AV_0.wav  Absent  train\n",
       " 2530_AV_1    train/2530_AV_1.wav  Absent  train\n",
       " 2530_AV_2    train/2530_AV_2.wav  Absent  train\n",
       " 2530_AV_3    train/2530_AV_3.wav  Absent  train\n",
       " 2530_AV_4    train/2530_AV_4.wav  Absent  train\n",
       " ...                          ...     ...    ...\n",
       " 85349_TV_2  train/85349_TV_2.wav  Absent  train\n",
       " 85349_TV_3  train/85349_TV_3.wav  Absent  train\n",
       " 85349_TV_4  train/85349_TV_4.wav  Absent  train\n",
       " 85349_TV_5  train/85349_TV_5.wav  Absent  train\n",
       " 85349_TV_6  train/85349_TV_6.wav  Absent  train\n",
       " \n",
       " [27361 rows x 3 columns])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Example of evar/work/16k/circor3\n",
    "len(evardf), evardf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Final steps: Check the data integrity\n",
    "\n",
    "### 6-1. Original data check\n",
    "\n",
    "1. \"The data consists of samples for classes Present/Absent/Unknown of 179/695/68.\" "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Classes are: ['Absent' 'Present' 'Unknown']\n",
      "Original circor-heart-sound/1.0.3/training_data.csv has samples: Absent:695(73.78%), Present:179(19.00%), Unknown:68(7.22%)\n"
     ]
    }
   ],
   "source": [
    "df = pd.read_csv('./physionet.org/files/circor-heart-sound/1.0.3/training_data.csv')\n",
    "\n",
    "print('Classes are:', df.Murmur.unique())\n",
    "\n",
    "A, P, U = [sum(df.Murmur == s) for s in ['Absent', 'Present', 'Unknown']]\n",
    "N = len(df)\n",
    "print(f'Original circor-heart-sound/1.0.3/training_data.csv has samples: Absent:{A}({A/N*100:.2f}%), Present:{P}({P/N*100:.2f}%), Unknown:{U}({U/N*100:.2f}%)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. \"Each sample consists of multiple recordings of variable-length audio, and there are 3,163 recordings in total.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3163\n"
     ]
    }
   ],
   "source": [
    "! find physionet.org -name *.wav |wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. \"The data were split with stratification by class labels into training/validation/test sets with a proportion of 65:10:25.\" -- The actual splits should be close to these numbers.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split ./datalist_stratified_data1.csv has samples: Training:2038(64.43%), Val:324(10.24%), Test:801(25.32%), total:3163\n",
      "Split ./datalist_stratified_data2.csv has samples: Training:2069(65.41%), Val:301(9.52%), Test:793(25.07%), total:3163\n",
      "Split ./datalist_stratified_data3.csv has samples: Training:2074(65.57%), Val:316(9.99%), Test:773(24.44%), total:3163\n"
     ]
    }
   ],
   "source": [
    "# Checking the preset stratification statistics.\n",
    "\n",
    "split_csvs = ['./datalist_stratified_data1.csv', './datalist_stratified_data2.csv', './datalist_stratified_data3.csv']\n",
    "for f in split_csvs:\n",
    "    df = pd.read_csv(f)\n",
    "    df = df[df.dest_file.str.endswith('.wav')]\n",
    "    Tr, V, Te = [sum(df.dest_file.str.contains(s)) for s in ['/train_data/', '/vali_data/', '/test_data/']]\n",
    "    N = len(df)\n",
    "    assert N == (Tr + V + Te)\n",
    "    print(f'Split {f} has samples: Training:{Tr}({Tr/N*100:.2f}%), Val:{V}({V/N*100:.2f}%), Test:{Te}({Te/N*100:.2f}%), total:{Tr+V+Te}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6-2. Fine-tuning data check"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split evar/evar/metadata/circor1.csv has samples: Training:17746(64.86%), Val:2857(10.44%), Test:6758(24.70%)\n",
      "Split evar/evar/metadata/circor2.csv has samples: Training:17986(65.74%), Val:2570(9.39%), Test:6805(24.87%)\n",
      "Split evar/evar/metadata/circor3.csv has samples: Training:17949(65.60%), Val:2580(9.43%), Test:6832(24.97%)\n"
     ]
    }
   ],
   "source": [
    "# Checking the created (and actually used) metadata files in EVAR.\n",
    "# Note that the samples are split into 5-s unified-length audios with 2.5-s strides.\n",
    "\n",
    "split_csvs = ['evar/evar/metadata/circor1.csv', 'evar/evar/metadata/circor2.csv', 'evar/evar/metadata/circor3.csv']\n",
    "for f in split_csvs:\n",
    "    df = pd.read_csv(f)\n",
    "    Tr, V, Te = [sum(df.split == s) for s in ['train', 'valid', 'test']]\n",
    "    N = len(df)\n",
    "    print(f'Split {f} has samples: Training:{Tr}({Tr/N*100:.2f}%), Val:{V}({V/N*100:.2f}%), Test:{Te}({Te/N*100:.2f}%)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### We finished preparation\n",
    "\n",
    "We are done with data and code preparation.\n",
    "\n",
    "Proceed to the notebook [1-Run-M2D.ipynb](1-Run-M2D.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ar",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
